The Transformer Architecture is primarily composed on two blocks:

Encoder: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.

Decoder: The decoder uses the encoder's representation(features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Each of these parts can be used independently, depending on the task:

- Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.

- Decoder-only models: Good for generative tasks such as text generation.

- Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.



HOW DOES THE ARCHITECTURE REALLY WORK?

Let's break down the architecture of Transformers into simpler terms:

- Purpose: Transformers were initially created for tasks like translating text from one language to another.

- Two Main Components: The architecture consists of two parts:
  1. Encoder: Processes the input sentence (in the original language).
  2. Decoder: Generates the translated sentence (in the target language).

- How the Encoder Works:
  - The encoder looks at the entire input sentence at once. It considers every word in the sentence because the meaning of a word can depend on the words before and after it.
  
- How the Decoder Works:
  - The decoder generates the translated sentence word by word. However, it can only see the words it has already generated, not the ones it hasn't translated yet. For example, when translating a sentence, if three words have been translated, the decoder uses these three words to predict the next word.

- Training Process:
  - During training, the decoder is given the entire target sentence (the correct translation) to speed up the process.
  - But there’s a catch: The decoder is only allowed to use the words that come before the word it's currently predicting. For instance, when trying to predict the fourth word, the decoder can only look at the first three words, not any future ones.
  
This setup helps the model learn to predict the next word in a sequence based on the context it has seen so far, both from the original sentence (via the encoder) and from what it has already translated (via the decoder).


